Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing

نویسندگان

  • Youngbae Kim
  • James S. Plank
  • Jack J. Dongarra
چکیده

Recently, an algorithm-based approach using diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this approach, since fault tolerance is incorporated into the matrix operations, the matrix operations become resilient to any single processor failure or change with low overhead. In this paper, we present a technique called multiple checkpointing to enable the matrix operations to tolerate a certain set of multiple processor failures by adding the capacity for multiple checkpointing processors. The results on a network of workstations have shown that this technique improves not only the reliability of the computation but also the performance of checkpointing.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing

Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless chec...

متن کامل

Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations

This paper is an exploration of diskless checkpointing for distributed scienti c computations. With the widespread use of the \Network Of Workstation" (NOW) platform for distributed computing, long-running scienti c computations need to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several algorithms for distributed scienti c c...

متن کامل

STAR: A Fault-Tolerant System for Distributed Applications

This paper presents a fault-tolerant manager for distributed applications. This manager provides an efficient recovery of hosts’ failures on networks of workstations. An independent checkpointing is used to automatically recover application processes affected by host failures. Domino-effects are avoided by means of message logging and file versions management. STAR provides an efficient softwar...

متن کامل

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

Fault-tolerant Distributed Applications In LiPS

Performing computations using networks of workstations is increasingly becoming an alternative to using a supercomputer. This approach is motivated by the the vast quantities of unused idle-time available in workstation networks. Unlike computing on a tightly coupled parallel computer, where a xed number of processor nodes is used within a computation, the number of useable nodes in a workstati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997